Space-efficient Data Structures for String Searching and Retrieval
نویسنده
چکیده
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Models of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Generalized Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Compressed Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Bit Vectors with Rank/Select Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Succinct Representation of Ordinal Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 Document Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.7 Differentially Encoding a Sorted Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.8 String B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3: External Memory Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Preliminary: Top-k Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 External Memory Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Breaking Down into Sub-Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Converting Top-k to Threshold via Logarithmic Sketch . . . . . . . . . . . . . . . 14 3.2.3 Special Structures for Bounded k . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 I/O-Optimal Data Structure via Bootstrapping . . . . . . . . . . . . . . . . . . . . 18 3.3 Adapting to Internal Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 4: Succinct Space Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Our Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.1 The Compressed Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Faster Compressed Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 5: Compact Space Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 The Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Storing and Retrieving the Lists top(x, z) . . . . . . . . . . . . . . . . . . . . . . . . . . 33 iv 5.4 Completing the Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4.1 Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.4.2 Computing Scores Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5 Reducing the Time to O(p+ k log∗ k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 6: Multipattern Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.1 Handling m > 2 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
منابع مشابه
Indexing Textual Information
Information retrieval is the computational discipline that deals with the efficient representation, organization, and access to information objects that represent natural language texts (Baeza-Yates, & Ribeiro-Neto, 1999; Salton & McGill, 1983; Witten, Moûat, & Bell, 1999). A crucial subproblem in the information retrieval area is the design and implementation of efficient data structures and a...
متن کاملUpper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures
The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for full-text indices which occupies minute space (FM-indexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XM...
متن کاملINSTRUCT - Space-Efficient Structure for Indexing and Complete Query Management of String Databases
The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure,...
متن کاملsiEDM: an efficient string index and search algorithm for edit distance with moves
Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has...
متن کاملPhrase Based Document Retrieving by Combining Suffix Tree index data structure and Boyer- Moore faster string searching algorithm
Phrase has been considered as a more informative feature term for improving the effectiveness of document retrieval .This paper propose an Algorithm A Phrase Based Document Retrieval to retrieve the similar documents by combining two exiting algorithm suffix tree ,index data structure and “The Boyer-Moore Algorithm”, faster string searching algorithm. The suffix tree is constructed based on E. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014